{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a05c860c",
   "metadata": {},
   "source": [
    "# How to split text by tokens \n",
    "\n",
    ":::info Prerequisites\n",
    "\n",
    "This guide assumes familiarity with the following concepts:\n",
    "\n",
    "- [Text splitters](/docs/concepts#text-splitters)\n",
    "\n",
    ":::\n",
    "\n",
    "Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7683b36a",
   "metadata": {},
   "source": [
    "## `js-tiktoken`\n",
    "\n",
    ":::{.callout-note}\n",
    "[js-tiktoken](https://github.com/openai/js-tiktoken) is a JavaScript version of the `BPE` tokenizer created by OpenAI.\n",
    ":::\n",
    "\n",
    "\n",
    "We can use `js-tiktoken` to estimate tokens used. It is tuned to OpenAI models.\n",
    "\n",
    "1. How the text is split: by character passed in.\n",
    "2. How the chunk size is measured: by the `js-tiktoken` tokenizer.\n",
    "\n",
    "You can use the [`TokenTextSplitter`](https://v02.api.js.langchain.com/classes/langchain_textsplitters.TokenTextSplitter.html) like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "4454c70e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Madam Speaker, Madam Vice President, our\n"
     ]
    }
   ],
   "source": [
    "import { TokenTextSplitter } from \"@langchain/textsplitters\";\n",
    "import * as fs from \"node:fs\";\n",
    "\n",
    "// Load an example document\n",
    "const rawData = await fs.readFileSync(\"../../../../examples/state_of_the_union.txt\");\n",
    "const stateOfTheUnion = rawData.toString();\n",
    "\n",
    "const textSplitter = new TokenTextSplitter({\n",
    "  chunkSize: 10,\n",
    "  chunkOverlap: 0,\n",
    "});\n",
    "\n",
    "const texts = await textSplitter.splitText(stateOfTheUnion);\n",
    "\n",
    "console.log(texts[0]);"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3bc155d0",
   "metadata": {},
   "source": [
    "**Note:** Some written languages (e.g. Chinese and Japanese) have characters which encode to 2 or more tokens. Using the `TokenTextSplitter` directly can split the tokens for a character between two chunks causing malformed Unicode characters.\n",
    "\n",
    "## Next steps\n",
    "\n",
    "You've now learned a method for splitting text based on token count.\n",
    "\n",
    "Next, check out the [full tutorial on retrieval-augmented generation](/docs/tutorials/rag)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Deno",
   "language": "typescript",
   "name": "deno"
  },
  "language_info": {
   "file_extension": ".ts",
   "mimetype": "text/x.typescript",
   "name": "typescript",
   "nb_converter": "script",
   "pygments_lexer": "typescript",
   "version": "5.3.3"
  },
  "vscode": {
   "interpreter": {
    "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
